{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "\n",
    "<img src=\"https://www.tibco.com/blog/wp-content/uploads/2015/08/tibco-logo-620x360.jpg\" style=\"float: left; width: 30%; margin-right: 1%; margin-bottom: 0.5em;\"><img src=\"https://www.skylinelabs.in/blog/images/tensorflow.jpg\" style=\"float: center; width: 40%; margin-right: 1%; margin-bottom: 0.5em;\">\n",
    "<p style=\"clear: both;\">"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Human Activity Detection using TensorFlow and Flogo\n",
    "\n",
    "Author : Venkata Jagannath, Data Scientist, TIBCO Software\n",
    "\n",
    "Date : October 17, 2017\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import IPython"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Deep neural networks and TensorFlow\n",
    "\n",
    "### Introduction to neural networks\n",
    "\n",
    "Neural networks are part of a specialized category of machine learning called deep learning. These advanced models are used in supervised and unsupervised learning tasks to find non-linear patterns between a set of variables. \n",
    "\n",
    "\n",
    "\n",
    "### TensorFlow\n",
    "\n",
    "TensorFlow is an open source deep learning library first released by Google in Nov 2015. It has since become the most popular library used for both development and production tasks. The backend allows users to deploy tasks to muiltple CPUs or GPUs. TensorFlow models can be outputed as [protocol buffer](https://developers.google.com/protocol-buffers/docs/overview) files.\n",
    "\n",
    "\n",
    "### Introduction\n",
    "\n",
    "#### Problem\n",
    "\n",
    "Sensors such as accelerometers produce several records of data every second. As sensors become more commonplace, having these devices communicate with web servers will give rise to latency and bandwidth issues. Data privacy is also a big concern. To address these problems, there is a need to let the data remain on the edge device and make decisions on the device itself. This approach has the following advantages -\n",
    "\n",
    "* Speed of execution - Since there are no latency & bandwidth issues, the speed of execution will greatly improve.\n",
    "* Cost of maintenance - MNCs can avoid the cost of setting up and maintaining huge servers to store and process data\n",
    "* Usage during network disruptions - The devices can be used even during network disruptions\n",
    "* Data privacy - Since the data does not leave the device, the issue of data privacy does not arise.\n",
    "\n",
    "\n",
    "#### Goal\n",
    "\n",
    "Our objective with this code is to learn a classifier that can accurately predict a human activity based on accelerometer readings. We will also need to output a protobuffer file that can be deployed directly on the edge device for scoring at data source.\n",
    "\n",
    "\n",
    "#### Approach\n",
    "\n",
    "For this project, we will be using the SEMMA approach - **S**ample the data, **E**xplore the sample for patterns, **M**odify the data, Build predictive **M**odels & **A**nalyze the results"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Import all necessary python packages\n",
    "\n",
    "python packages including tensorflow can be installed from the command line using -\n",
    "\n",
    "pip install tensorflow"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import tensorflow as tf\n",
    "from sklearn import metrics"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Sample\n",
    "\n",
    "The data collected from an accelerometer contains three columns - 'x', 'y' & 'z'. The accelerometer outputs 20 records of data every second. An intial sample of 21 seconds of data for each activity is collected.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Specify the Inputs:\n",
    "\n",
    "**'data_location' :** location of the data files\n",
    "\n",
    "**'model_output_loc' :** location where the protoBuffer file must be saved\n",
    "\n",
    "**'hidden_units' :** Number of hidden layers in DNN Classifier model and number of neurons in each layer.\n",
    "\n",
    "**'learn_rate' :** Learning rate for the neural network"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "data_location = './tn_training_data/'\n",
    "model_output_loc = './models/TB/'\n",
    "hidden_units = [100, 40, 3]\n",
    "learn_rate = 0.01"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explore & Modify\n",
    "\n",
    "While exploring the above 21 seconds of data, we can observe a pattern for each activity. \n",
    "\n",
    "Autocorrelation plots over a period of values provide insights on how correlated a value at time t1 is to a value at time t2. \n",
    "\n",
    "Based on the insights obtained from the a TIBCO Spotfire analysis, we can conclude that a time lag of 10 is likely to capture significant variations between the three activities. \n",
    "\n",
    "We collect 5 training samples for each of these activities while altering the orientation of the accelerometer device. \n",
    "\n",
    "Using this information to create a final training dataset using the following steps -\n",
    "\n",
    "* Read in all five datasets for each activity\n",
    "* For each dataset of each activity - \n",
    "    * create 10 new features for each of 'x','y' & 'z' (lag - 1 to 10)\n",
    "    * Drop 'na' values in the temporary dataframe\n",
    "    * Append to final data frame\n",
    "* Return final data frame to be use for training our model"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "def training_data(path,label,lag):\n",
    "    \n",
    "    filepath = path\n",
    "    data = pd.DataFrame()\n",
    "    \n",
    "    for file in [\"\",\"2\",\"3\",\"4\",\"5\"]:\n",
    "        \n",
    "        path = filepath + label.lower() +file +\".csv\"\n",
    "        \n",
    "        column_names = ['x', 'y', 'z']\n",
    "        temp = pd.read_csv(path,header = None, names = column_names)\n",
    "        temp['activity'] = label\n",
    "        \n",
    "        for i in range(1,lag+1):\n",
    "            temp['x'+str(i)] = temp['x'].shift(-1 * i)\n",
    "            temp['y'+str(i)] = temp['y'].shift(-1 * i)\n",
    "            temp['z'+str(i)] = temp['z'].shift(-1 * i)\n",
    "        \n",
    "        temp = temp.dropna()\n",
    "        \n",
    "        if data.empty:\n",
    "            data = temp\n",
    "        else:\n",
    "            data = data.append(temp)\n",
    "    \n",
    "    return data\n",
    "\n",
    "    \n",
    "dataset = training_data(data_location,\"jogging\",10)\n",
    "dataset = dataset .append(training_data(data_location,\"Walking\",10))\n",
    "dataset = dataset .append(training_data(data_location,\"Standing\",10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Separate the final dataset into training & test data (80-20 split) "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "msk = np.random.rand(len(dataset)) < 0.8\n",
    "train = dataset[msk]\n",
    "test = dataset[~msk]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Create variables for number of columns, continuous cols, label col & the label names"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "num_labels = train['activity'].nunique()\n",
    "cont_cols = [x for x in list(train.columns.values) if x != 'activity']\n",
    "lab_col = 'activity'\n",
    "label_names = list(train['activity'].unique())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Use a function to convert a pandas dataframe to tensors"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": [
    "tf.logging.set_verbosity(0)\n",
    "\n",
    "def get_input_fn_from_pandas(data_set, num_epochs=None, shuffle=False):\n",
    "    \n",
    "    return tf.estimator.inputs.pandas_input_fn(x=data_set[cont_cols],y=data_set[lab_col],num_epochs=num_epochs,shuffle=shuffle)\n",
    "\n",
    "feat_cols = [tf.feature_column.numeric_column(feat) for feat in cont_cols]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Train DNNClassifier model\n",
    "\n",
    "Fit a DNN classifer with the above specified hidden layers and learn rate"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "<tensorflow.python.estimator.canned.dnn.DNNClassifier at 0x290a4c4e550>"
      ]
     },
     "execution_count": 8,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf = tf.estimator.DNNClassifier(model_dir=model_output_loc,hidden_units=hidden_units,\n",
    "                                 feature_columns=feat_cols,n_classes=num_labels,\n",
    "                                 label_vocabulary= label_names,\n",
    "                                 optimizer= tf.train.ProximalAdagradOptimizer(learning_rate=learn_rate,\n",
    "                                                                                         l1_regularization_strength=0.001))\n",
    "clf.train(input_fn=get_input_fn_from_pandas(train),steps=10000)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Evaluate\n",
    "\n",
    "Predict on the 20% unseen dataset"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'accuracy': 0.94217604,\n",
       " 'average_loss': 0.22659916,\n",
       " 'global_step': 10000,\n",
       " 'loss': 28.962204}"
      ]
     },
     "execution_count": 9,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "clf.evaluate(get_input_fn_from_pandas(test,10))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Output .pb file\n",
    "\n",
    "Create a protobuffer file and save it to the model output location specified above"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Tensorflow model saved at : ./models/TB/\n"
     ]
    }
   ],
   "source": [
    "feature_spec = tf.estimator.classifier_parse_example_spec(feat_cols,label_key=lab_col)\n",
    "serving_input_fn = tf.estimator.export.build_parsing_serving_input_receiver_fn(feature_spec)\n",
    "servable_model_path = clf.export_savedmodel(model_output_loc, serving_input_fn, as_text=False)\n",
    "\n",
    "\n",
    "print (\"Tensorflow model saved at : \" + model_output_loc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "collapsed": true
   },
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}
